---
title: "Code Chunker"
description: "Split code into chunks based on code structure"
icon: "laptop"
---

The `CodeChunker` splits code into chunks based on its structure, leveraging Abstract Syntax Trees (ASTs) to create contextually relevant segments.

## API Reference

To use the `CodeChunker` via the API, check out the [API reference documentation](../../api/chunkers/code-chunker).

## Installation

CodeChunker requires additional dependencies for code parsing. You can install it with:

```bash
pip install "chonkie[code]"
```

<Info>
  For installation instructions, see the [Installation
  Guide](/oss/installation).
</Info>

## Initialization

```python
from chonkie import CodeChunker

# Basic initialization with default parameters
chunker = CodeChunker(
    language="python",      # Specify the programming language
    tokenizer="character",  # Default tokenizer (or use "gpt2", etc.)
    chunk_size=2048,        # Maximum tokens per chunk
    include_nodes=False     # Optionally include AST nodes in output
)

# Using a custom tokenizer
from tokenizers import Tokenizer
custom_tokenizer = Tokenizer.from_pretrained("your-tokenizer")
chunker = CodeChunker(
    language="javascript",
    tokenizer=custom_tokenizer,
    chunk_size=2048
)
```

## Parameters

<ParamField path="language" type="str" required>
  The programming language of the code. Accepts languages supported by
  `tree-sitter-language-pack`.
</ParamField>

<ParamField
  path="tokenizer"
  type="Union[str, Callable, Any]"
  default="character"
>
  Tokenizer or token counting function to use for measuring chunk size.
</ParamField>

<ParamField path="chunk_size" type="int" default="2048">
  Maximum number of tokens per chunk.
</ParamField>

<ParamField path="include_nodes" type="bool" default="False">
  Whether to include AST node information (Note: with the base Chunk type, node
  information is not stored).
</ParamField>

## Usage

### Single Code Chunking

```python
code = """
def hello_world():
    print("Hello, Chonkie!")

class MyClass:
    def __init__(self):
        self.value = 42
"""
chunks = chunker.chunk(code)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Language: {chunk.lang}")
    if chunk.nodes:
        print(f"Node count: {len(chunk.nodes)}")
```

### Batch Chunking

```python
codes = [
    "def func1():\n    pass",
    "const x = 10;\nfunction add(a, b) { return a + b; }"
]
batch_chunks = chunker.chunk_batch(codes)

for doc_chunks in batch_chunks:
    for chunk in doc_chunks:
        print(f"Chunk: {chunk.text}")
```

### Using as a Callable

```python
# Single code string
chunks = chunker("def greet(name):\n    print(f'Hello, {name}')")

# Multiple code strings
batch_chunks = chunker(["int main() { return 0; }", "package main\nimport \"fmt\"\nfunc main() { fmt.Println(\"Hi\") }"])
```

## Return Type

CodeChunker returns chunks as `Chunk` objects:

```python
@dataclass
class Chunk:
    text: str           # The chunk text (code snippet)
    start_index: int    # Starting position in original code
    end_index: int      # Ending position in original code
    token_count: int    # Number of tokens in chunk
    context: Optional[Context] = None    # Optional context metadata
    embedding: Union[List[float], "np.ndarray", None] = None  # Optional embedding vector
```

<Note>
  As of version 1.3.0, CodeChunker returns the base `Chunk` type instead of the
  specialized `CodeChunk` type. This simplifies integration with other chunkers
  and refineries.{" "}
</Note>
